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Abstract 

Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for 
de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is 
coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately 
assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of 
uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate 
the processing of large and complex datasets, and will help in generating biological insights for specific 
environments. Ray Meta is open source and available at http://denovoassembler.sf.net. 
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Background 

While voluminous datasets from high-throughput sequen- 
cing experiments have allowed new biological questions to 
emerge [1,2], the technology's speed and scalability are not 
yet matched by available analysis techniques and the gap 
between them has been steadily growing [3,4]. The de 
Bruijn graph is a structure for storing DNA words - or 
k-mers - that occur in sequence datasets [5,6]. Recent 
work showed that adding colors to a de Bruijn graph can 
allow variants to be called even in the absence of a com- 
plete genome reference [7]. 

The field of metagenomics is concerned with the analy- 
sis of communities by sampling the DNA of all species in 
a given microbial community. The assembly of metagen- 
omes poses greater and more complex challenges than 
single-genome assembly as the relative abundances of the 
species in a microbiome are not uniform [8]. A com- 
pounding factor is the genetic diversity represented by 
polymorphisms and homologies between strains, which 
increases the difficulty of the problem for assemblers [8]. 
Moreover, the underlying diversity of the sample 
increases its complexity and adds to the difficulties of 
assembly. Last but not least, DNA repeats can produce 
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misassemblies [9] in the absence of fine-tuned, accurate 
computational tools [10]. 

The microbial diversity in microbiomes contains the 
promise of finding new genes with novel and interesting 
biological functions [11]. While the throughput in metage- 
nomics is increasing fast, bottlenecks in the analyses are 
becoming more apparent [12], indicating that only equally 
parallel - and perhaps highly distributed - analysis systems 
can help bridge the scalability gap. Parallel sequencing 
requires parallel processing for bioprospecting and for 
making sense of otherwise largely unknown sequences. 

Environmental microbiomes have been the subject of 
several large-scale investigations. Viral genome assemblies 
have been obtained from samples taken from hot springs 
[13]. Metabolic profiling of microbial communities from 
Antarctica [14] and the Arctic [15] provided novel insights 
into the ecology of these communities. Furthermore, a 
new Archaea lineage was discovered in a hypersaline 
environment by means of metagenomic assembly [16]. 
The metabolic capabilities of terrestrial and marine micro- 
bial communities have been compared [17]. The structure 
of communities in the environment has been recon- 
structed [18]. All these studies show that environmental 
microbiomes are reservoirs of genetic novelty [19], which 
bioprospecting aims at discovering. 

Through metagenomic analysis, the interplay between 
host and commensal microbial metabolic activity can be 
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studied, promising to shed light on its role in maintaining 
human health. Furthermore, precisely profiling the human 
microbial and viral flora at different taxonomic levels as 
well as functional profiling may hint at improved new 
therapeutic options [20]. To that end, the human distal 
gut microbiome of two healthy adults was analyzed by 
DNA sequencing [21], and subsequently the human gut 
microbiome of 124 European individuals was analyzed by 
DNA sequencing from fecal samples by the MetaHIT con- 
sortium [22] . Another study proposed that there are three 
stable, location-independent, gut microbiome enterotypes 
[23] . Finally, the structure, function and diversity of the 
healthy human microbiome were investigated by the 
Human Microbiome Project Consortium [24] . 

With 16S rRNA gene sequencing, species representation 
can be extracted by taxonomic profiling [25]. However, 
using more than one marker gene produces better taxo- 
nomic profiles [26,27] . Furthermore, a taxonomy based on 
phylogenetic analyses helps in the process of taxonomic 
profiling [28]. While taxonomic profiles are informative, 
functional profiling is also required to understand the biol- 
ogy of a system. To that end, gene ontology [29] can 
assign normalized functions to data. 

Although not designed for metagenomes, distributed 
software for single genomes, such as ABySS [30] and Ray 
[31], illustrate how leveraging high-performance and par- 
allel computing could greatly speed up the analysis of the 
large amount of data generated by metagenome projects. 
Notably, sophisticated parallel tools are easily deployed 
on cloud computing infrastructures [32] or on national 
computing infrastructures through their use of a cross- 
platform, scalable method called the message-passing 
interface. 

Taxonomic profiling methods utilize alignments 
[26,27,33-36]or hidden Markov models [37] or both[38]. 
Few methods are available for metagenome de novo 
assembly (Meta Velvet [39], Meta-IDBA [40] and Genovo 
[41]), none couples taxonomic and ontology profiling with 
de novo assembly, and none is distributed to provide scal- 
ability. Furthermore, none of the existing methods for 
de novo metagenome assembly distributes memory utiliza- 
tion over more than one compute machine. This addi- 
tional difficulty plagues current metagenome assembly 
approaches. 

The field of metagenomic urgently needs distributed and 
scalable processing methods to tackle efficiently the size of 
samples and the assembly and profiling challenges that 
this poses. Herein we show that Ray Meta, a distributed 
processing application, is suited for metagenomics. We 
present results obtained by de novo metagenome assembly 
with coupled profiling. With Ray Meta, we show that the 
method scales for two metagenomes simulated to incorpo- 
rate sequencing errors: a 100-genome metagenome 
assembled from 400 x 10 101-nucleotide reads and a 



1,000-genome metagenome assembled from 3 x 10 9 100- 
nucleotide reads. Ray Communities utilizes bacterial 
genomes to color the assembled de Bruijn graph. The 
Greengenes taxonomy [28] was utilized to obtain the 
profiles from colored k-mers. Other taxonomies, such as 
the NCBI taxonomy, can be substituted readily. We also 
present results obtained by de novo metagenome assembly 
and taxonomic and functional profiling of 124 gut micro- 
biomes. We compared Ray Meta to MetaVelvet and 
validated Ray Communities with MetaPhlAn taxonomic 
profiles. 

Results 

Scalability 

In order to assess the scalability of Ray Meta, we simulated 
two large datasets. Although a simulation does not capture 
all genetic variations (and associated complexity) occurring 
in natural microbial populations, it is a way to validate the 
correctness of assemblies produced by Ray Meta and the 
abundances predicted by Ray Communities. The first data- 
set contained 400 x 10 reads, with 1% as human contami- 
nation. The remaining reads were distributed across 100 
bacterial genomes selected randomly from GenBank. The 
read length was 101 nucleotides, the substitution error rate 
was 0.25% and reads were paired. Finally, the proportion of 
bacterial genomes followed a power law (with exponent 
-0.5) to mimic what is found in nature (see the section on 
Materials and methods). The number of reads for this 100- 
genome metagenome roughly corresponds to the number 
of reads generated by one lane of an Illumina HiSeq 2000 
flow cell (Illumina, Inc.). Table SI in Additional file 1 lists 
the number of reads for each bacterial genome and for the 
human genome. This dataset was assembled by Ray Meta 
using 128 processor cores in 13 hours, 26 minutes, with an 
average memory usage of 2 GB per core. The resulting 
assembly contained 22,162 contigs with at least 100 
nucleotides and had an N50 of 152,891. The sum of contig 
lengths was 345,945,478 nucleotides. This is 93% of the 
sum of bacterial genome lengths, which was 371,623,377 
nucleotides. Therefore, on average there were 3,459,454 
assembled nucleotides and 221 contigs per bacterial gen- 
ome, assuming that the bacterial genomes were roughly of 
the same size and same complexity and that the coverage 
depth was not sufficient to assemble incorporated human 
contamination. Using the known reference sequences, we 
validated the assembly using MUMmer to assess the qual- 
ity. There were 11,220 contigs with at least 500 nucleotides. 
Among these, 152 had misassemblies (1.35%). Any contig 
that did not align as one single maximum unique match 
with a breadth of coverage of at least 98.0% was marked as 
misassembled. The number of mismatches was 1,108 while 
the number of insertions or deletions was 597. 

To further investigate the scalability of our approach 
for de novo metagenome assembly, we simulated a 
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second metagenome. This one contained 1,000 bacterial 
genomes randomly selected from GenBank as well as 
1% of human sequence contamination. The proportion 
of the 1,000 bacterial genomes was distributed according 
to a power law (with exponent -0.3) and the number of 
reads was 3 x 10 9 (Table S2 in Additional file 1). This 
number of reads is currently generated by one Illumina 
HiSeq 2000 flow cell (Illumina, Inc.). This second data- 
set, which is larger, was assembled de novo by Ray Meta 
in 15 hours, 46 minutes using 1,024 processor cores 
with an average memory usage of 1.5 GB per core. It 
contained 974,249 contigs with at least 100 nucleotides; 
N50 was 76,095 and the sum of the contig lengths was 
2,894,058,833, or 80% of the sum of bacterial genome 
lengths (3,578,300,288 nucleotides). Assuming a uniform 
distribution of assembled bases and contigs and that 
human sequence coverage depth was not sufficient for 
its de novo assembly, there were, on average, 974 contigs 
and 2,894,058 nucleotides per bacterial genome. To 



validate whether or not the produced contigs were of 
good quality, we compared them to the known refer- 
ences. There were 196,809 contigs with at least 500 
nucleotides. Of these, 2,638 were misassembled (1.34%) 
according to a very stringent test. There were 59,856 
mismatches and 13,122 insertions or deletions. 

Next, we sought to quantify the breadth of assembly 
for the bacterial genomes in the 1,000-genome dataset. 
In other words, the assembled percentage was calculated 
for each genome present in the 1,000-genome metagen- 
ome. Many of these bacterial genomes had a breadth of 
coverage (in the de novo assembly) greater than 95% 
(Figure 1). 

Estimating bacterial proportions 

Another problem that can be solved with de Bruijn graphs 
is estimating the genome nucleotide proportion within a 
metagenome. Using Ray Communities, the 100-genome 
and 1,000-genome datasets de novo assembled de Bruijn 
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Figure 1 Assembled proportions of bacterial genomes for a simulated metagenome with sequencing errors 3x10 100-nucleotide 
reads were simulated with sequencing errors (0.25%) from a simulated metagenome containing 1,000 bacterial genomes with proportions 
following a power law. Having 1,000 genomes with power law proportions makes it impossible to classify sequences with their coverage. This 
large metagenomic dataset was assembled using distributed de Bruijn graphs and profiled with colored de Bruijn graphs. Highly similar, but 
different genomes, are likely to be hard to assemble. This figure shows the proportion of each genome that was assembled de novo within the 
metagenome. Of the bacterial genomes, 88.2% were assembled with a breadth of coverage of at least 80.0%. 
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graphs were colored using all sequenced bacterial genomes 
(Table S4 in Additional file 1) in order to identify contigs 
and to estimate bacterial proportions in the datasets. Ray 
Communities estimates proportions by demultiplexing 
k-mer coverage depth in the distributed de Bruijn graph 
(see the section on Demultiplexing signals from similar bac- 
terial strains in Materials and methods). Because coloring 
occurs after de novo assembly has completed, the reference 
sequences are not needed for assembling metagenomes. 

For the 100-genome dataset, only two bacterial genome 
proportions were not estimated correctly. The first was 
due to a duplicate in GenBank and the second to two 
almost identical genomes (Figure 2A). When two identi- 
cal genomes are provided as a basis to color the de Bruijn 
graph, no k-mer is uniquely colored for any of these two 
genomes, and identifying k-mers cannot be found 
through demultiplexing. This can be solved by using a 
taxonomy, which allows reference genomes to be similar 
or identical. 

In the 1,000-genome dataset, four bacterial genome pro- 
portions were overestimated and 20 were underestimated 
(Figure 2B). In both the 100-genome and 1,000-genome 
datasets, the proportion of bacterial genomes with incor- 
rect estimates was 2.0%. In both of these, the incorrect 
estimates were caused by either duplicated genomes, iden- 
tical genomes or highly similar genomes. The use of a tax- 
onomy alleviates this problem. 

The results with the 100-genome and 1,000-genome 
datasets show that our method can recover bacterial gen- 
ome proportions when the genome sequences are known. 
In real microbiome systems, there is a sizable proportion 
of unknown bacterial species. For this reason, it is impor- 
tant to devise a system that can also accommodate 
unknown species by using a taxonomy, which allows the 
classification to occur at higher levels - such as phylum or 
genus instead of species. 

Metagenome de novo assembly of real datasets 

Here, we present results for 124 fecal samples from a pre- 
vious study [22]. From the 124 samples, 85 were from 
Denmark (all annotated as being healthy) and 39 were 
from Spain (14 were healthy, 21 had ulcerative colitis and 
4 had Crohn's disease). Each metagenome was assembled 
independently (Table S3 in Additional file 1) and the 
resulting distributed de Bruijn graphs were colored to 
obtain taxonomic and gene ontology profiles (see Materi- 
als and methods and Table S4 in Additional file 1). 

These samples contained paired 75-nucleotide and/or 
44-nucleotide reads obtained with Illumina Genome Ana- 
lyzer sequencers. In about 5 hours, 122 samples were 
assembled (and profiled) using 32 processor cores and the 
two remaining samples, namely MH0012 and MH0014, 
were assembled (and profiled) with 48 and 40 processor 
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Figure 2 Estimated bacterial genome proportions For the two 

simulated metagenomes (100 and 1,000 bacterial genomes, 
respectively), colored de Bruijn graphs were utilized to estimate the 
nucleotide proportion of each bacterial genome in its containing 
metagenome. Genome proportions in metagenomes followed a 
power law. Black lines show the expected nucleotide proportion for 
bacterial genomes while blue points represent proportions 
measured by colored de Bruijn graphs. (A) For the 100-genome 
metagenome, only two bacterial genomes were not correctly 
measured (2.0%), namely Methanococcus maripaludis X1 and Serratia 
AS9. Methanococcus maripaludis X1 was not detected because it 
was duplicated in the dataset as Methanococcus maripaludis XI, thus 
providing zero uniquely colored k-mers. Serratia AS9 was not 
detected because it shares almost all its k-mers with Serratia AS12. 
(B) For the 1,000-genome metagenome, 4 bacterial genomes were 
overestimated (0.4%) while 20 were underestimated (2.0%). These 
errors were due to highly similar bacterial genomes, hence they did 
not provide uniquely colored k-mers. This problem can be alleviated 
either by using a curated set of reference genomes or by using a 
taxonomy. The remaining 976 bacterial genomes had a measured 
proportion near the expected value. 



cores, respectively (Table S3 in Additional file 1). These 
runtime figures include de novo assembly, graph coloring, 
signal demultiplexing and taxonomic and gene ontology 
profiling, which are all tightly coupled in the process. In 
the next section, taxonomic profiles are presented for 
these 124 gut microbiome samples. 
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Taxonomic profiling 

In metagenomic projects, the bacterial genomes that occur 
in the sample may be unknown at the species level. How- 
ever, it is possible to profile these samples using a taxon- 
omy. The key concept is to classify colored k-mers in a 
taxonomy tree: a k-mer is moved to a higher taxon as long 
as many taxons have the k-mer so it can be classified as 
the nearest common ancestor of the taxons. For example if 
a k-mer is not classified at the species level, it can be classi- 
fied at the genus level and so on. Furthermore, taxonomy 
profiling does not suffer from similarity issues as seen for 
proportions present in samples because k-mers can be clas- 
sified in higher taxons when necessary. 

Accordingly, k-mers shared by several bacterial species 
cannot be assigned to one of them accurately. For this 
reason, the Greengenes taxonomy [28] (version 2011_11) 
was utilized to classify each colored k-mer in a single 
taxon with its taxonomic rank being one of the following: 
kingdom, phylum, class, order, family, genus or species. 
For each sample, abundances were computed at each 
taxonomic rank. At the moment, the most recent and 
accurate taxonomy for profiling taxons in a metagenome 
is Greengenes [28]. We profiled taxons in the 124 gut 
microbiome samples using this taxonomy. We also incor- 
porated the human genome into this taxonomy to profile 
the human abundance in the process. At the phylum 
level, the two most abundant taxons were Firmicutes and 
Bacteroidetes (Figure 3A). The profile of the phylum 
Chordata indicated that two samples contained signifi- 
cantly more human sequences than the average (Figure 
3A). The most abundant genera in the 124 samples were 
Bacteroides and Prevotella (Figure 3B). The taxon Bacter- 
oides is reported more than once because several taxons 
had this name with a different ancestry in the Green- 
genes taxonomy. The genera Prevotella and Butyrivibrio 
had numerous samples with higher counts, indicating 
that the data are bi-modal (Figure 3B). The genus Homo 
had two samples with significantly more abundance 
(Figure 3B). 

Grouping abundance profiles 

It has been proposed that the composition of the 
human gut microbiome of an individual can be classi- 
fied as one of three enterotypes [23]. We profiled 
genera for each of the 124 gut microbiome samples 
to reproduce these three enterotypes. The 124 samples 
(85 from Denmark and 39 from Spain) were analyzed 
using the two most important principal components 
(Figure 4; see Materials and methods). Two clear 
clusters are visible, one enriched for the genus Bacter- 
oides and one for the genus Prevotella. A continuum 
between two enterotypes has also been reported 
recently [42]. 



Profiling of ontology terms 

Gene ontology is a hierarchical classification of normalized 
terms in three independent domains: biological process, 
cellular component and molecular function. Some biologi- 
cal datasets are annotated with gene ontology. Here, we 
used gene ontology to profile the 124 metagenome sam- 
ples based on a distributed colored de Bruijn graph (see 
Materials and methods). First, abundances for biological 
process terms were obtained (Figure 5A). The two most 
abundant terms were metabolic process and transport. 
The terms oxidation-reduction process and DNA 
recombination had numerous sample outliers, which 
indicates that these samples had different biological 
complexity for these terms (Figure 5A). Next, we 
sought to profile cellular component terms in the sam- 
ples. The most abundant term was membrane, fol- 
lowed by cytoplasm, integral to membrane and plasma 
membrane. This redundancy is due to the hierarchical 
structure of gene ontology (Figure 5B). Finally, we 
measured the abundance for molecular function terms. 
The most abundant was ATP binding, which had no 
outliers. The term DNA binding was also abundant. 
However, the latter had outlier samples (Figure 5C). 

Comparison of assemblies 

Three samples from the MetaHIT Consortium [22] - 
MH0006 (ERS006497), MH0012 (ERS006494) and 
MH0047 (ERS006592) - and three samples from the 
Human Microbiome Project Consortium [24] - SRSO 
11098, SRS017227 and SRS018661 - were assembled 
with Meta Velvet [39] and Ray Meta to draw a compar- 
ison. Assembly metrics are displayed in Table 1. The 
average length is higher for Meta Velvet for samples 
ERS006494 and ERS006592. For the other samples, the 
average length is higher for Ray Meta. The N50 length 
is higher for Ray Meta for all samples. For all samples 
but ERS006497, the total length is higher for Ray 
Meta. Although we assembled the 124 samples from 
[22] and 313 samples (out of 764) from the Human 
Microbiome Project [24] with Ray Meta on supercom- 
puters composed of nodes with little memory (24 GB), 
we only assembled a few samples with MetaVelvet 
because a single MetaVelvet assembly requires exclu- 
sive access to a single computer with a large amount 
of available memory (at least 128 GB). Ray Meta pro- 
duced longer contigs and more bases for these six 
samples. The shared content of assemblies produced 
by MetaVelvet and Ray Meta is shown in Table 1. 
A majority of sequences assembled by MetaVelvet and 
Ray Meta are shared. As metagenomic experiments 
will undoubtedly become more complex, Ray Meta will 
gain a distinct advantage owing to its distributed 
implementation. 
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Figure 3 Fast and efficient taxonomic profiling with distributed colored de Bruijn graphs From a previous study, 124 metagenomic samples 
containing short paired reads were assembled de novo and profiled for taxons. The graph coloring occurred once the de Bruijn graph was assembled 
de novo. (A) The taxonomic profiles are shown for the phylum level. The two most abundant phyla were Firmicutes and Bacteroidetes. This is in 
agreement with the literature [22], The abundance of human sequences was also measured. The phylum Chordata had two outlier samples. This 
indicates that two of the samples had more human sequences than the average, which may bias results. (B) At the genus level, the most abundant 
taxon was Bacteroides. This taxon occurred more than once because it was present at different locations within the Greengenes taxonomic tree. Also 
abundant is the genus Prevotella. Furthermore, the later had numerous samples with higher counts, which may help in non-parametric clustering. Two 
samples had higher abundance of human sequences, as indicated by the abundance of the genus Homo. 



Validation of taxonomic profiling 

We compared Ray Communities to MetaPhlAn in order 
to validate our methodology. Taxonomic profiles for 313 
samples (Additional file 2) from the Human Microbiome 
Project [24] were generated with Ray Communities and 
compared to those of MetaPhlAn [27]. The correlations 
are shown in Table 2 for various body sites. Correlations 
are high - for instance the correlations for buccal mucosa 
(46 samples) were 0.99, 0.98, 0.97, 0.98, 0.95 and 0.91 for 
the ranks phylum, class, order, family, genus and species, 
respectively. These results indicate that Ray Communities 



has an accuracy similar to that of MetaPhlAn [27], which 
was utilized by the Human Microbiome Project Consor- 
tium [24]. The correlation at the genus rank for the site 
anterior nares was poor (0.59) because MetaPhlAn classi- 
fied a high number of reads in the genus Propionibacter- 
ium thus yielding a very high abundance while the number 
of k-mer observations classified this way by Ray Commu- 
nities was more moderate. For the body site called stool, 
the correlation at the family rank was weak (0.62) because 
MetaPhlAn utilizes the NCBI taxonomy whereas Ray Com- 
munities utilizes the Greengenes taxonomy, which has 
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Figure 4 Principal component analysis shows two clusters 

Principal component analysis (see Materials and methods) with 
abundances at the genus level yielded two distinct clusters. 
Abundances were obtained with colored de Bruijn graphs. One was 
enriched in the genus Bacteroides while the other was enriched in 
the genus Prevotella. Principal component 1 was linearly correlated 
with the genus Prevotella while principal component 2 was linearly 
correlated with the genus Bacteroides. This analysis suggests that 
there is a continuum between the two abundant genera Bacteroides 
and Prevotella. This interpretation differs from the origina 
publication in which three human gut enterotypes were reported 
[23]. More recently, it has been proposed that there are only two 
enterotypes and individuals are distributed in a continuum between 
the two [42]. 



been shown to be more accurate [28] . Overall, these results 
indicate that Ray Communities yields accurate taxonomic 
abundances using a colored de Bruijn graph. 

Discussion 

Message passing 

Ray Meta is a method for scalable distributed de novo 
metagenome assembly whereas Meta Velvet runs only on 
a single computer. Therefore, fetching data with Meta- 
Velvet is fast because only memory accesses occur. On 
the other hand, Ray Meta runs on many computers. 
Although this is a benefit at first sight, using many com- 
puters requires messages to be sent back and forth in 
order to fetch data. We used 8 nodes totaling 64 proces- 
sor cores (8 processor cores per node) for Human Micro- 
biome Project samples and the observed point-to-point 
latency (within our application, not the hardware latency) 
was around 37 microseconds - this is much more than 
the 100 nanoseconds required for main memory accesses. 
However, by minimizing messages, RayMeta runs in an 
acceptable time and has a scalability unmatched by Meta- 
Velvet while providing superior assemblies (Table 1). 



From Ray to Ray Meta 

For single genomes, peak coverage is required by Ray in 
the k-mer coverage distribution [31]. This is not the case 
for Ray Meta. Moreover, in Ray for single genomes, read 
markers are selected using the peak coverage and mini- 
mum coverage. This process is local to each read path in 
Ray Meta. This is in theory less precise because there are 
fewer coverage values, but in practice it works well as 
shown in this work. In Ray for single genomes, the unique 
k-mer coverage for a seed path (similar to a unitig) is sim- 
ply the peak k-mer coverage for the whole graph whereas 
in Ray Meta the coverage values are sampled from the 
seed path only. 

Algorithms for metagenome assembly 

Notwithstanding the non-scalability of all de novo meta- 
genome assemblers except Ray Meta (Meta Velvet [39], 
Meta-IDBA [40] and Genovo [41]), there are major dif- 
ferences in the algorithms these software tools imple- 
ment, which are unrelated to scalability. 

Genovo is an assembler for 454 reads. It uses a genera- 
tive probabilistic model and applies a series of hill-climbing 
steps iteratively until convergence [41]. For Genovo, the 
largest dataset processed had 311,000 reads. Herein, the 
largest dataset had 3,000,000,000 reads. Meta Velvet and 
Meta-IDBA both partition the de Bruijn subgraph using k- 
mer coverage peaks in the k-mer coverage distribution 
and/or connected components. This process does not 
work well in theory when there is no peak in the coverage 
distributions. Meta Velvet and Meta-IDBA both simplify 
the de Bruijn graph iteratively - this approach, termed 
equivalent transformations, was introduced by Pevzner and 
collaborators [43]. One of the many advantages of using 
equivalent transformations is that the assembled sequences 
grow in length and their number decreases as the algo- 
rithm makes its way toward the final equivalent transfor- 
mation. Equivalent transformations are hard to port to a 
distributed paradigm because the approach requires a 
mutable graph. 

Ray Meta does not modify the de Bruijn subgraph in 
order to generate the assembly. We showed that applying 
a heuristics-guided graph traversal yields excellent assem- 
blies. Furthermore, working with k-mers and their rela- 
tionships directly is more amenable to distributed 
computing because unlike k-mers, contigs are neither reg- 
ular nor small and are hard to load balance on numerous 
processes. 

Taxonomic profiling with k-mers 

For taxonomic profiling, we have shown that Ray Commu- 
nities is accurate when compared to MetaPhlAn (Table 2). 
Our approach consists in building a de Bruijn graph from 
the raw sequencing reads, assembling it de novo, and then 
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Figure 5 Ontology profiling with colored de Bruijn graphs. Gene ontology profiles were obtained by coloring of the graph resulting from de novo 
assembly. Gene ontology has three domains: biological process, cellular component and molecular function. For each domain, only the 1 5 most 
abundant terms are displayed. (A) Ontology terms in the biological process domain were profiled. Some of these have several outlier samples, namely 
oxidation-reduction process and DNA recombination. (B) Ontology profiling for cellular component terms is shown. The most abundant is the 
membrane term. (C) The profile for molecular function terms is shown. Binding functions are the most abundant with ATP binding, nucleotide 
binding and DNA binding in the top three. Next is catalytic activity, which is a general term. More specific catalytic activities are listed. 
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Table 1 Comparison of assemblies produced by 


coloring it with thousands of bacterial genomes in order to 


MetaVelvet and Ray Meta 






obtain an accurate profile of the sequenced metagenome. 




MetaVelvet 


Ray Meta 


Shared 


By using whole genomes instead of a few selected marker 


FR^00fi4Q4 








genes, such as the 16S RNA gene, some biases are removed 


Reads 


37? 147 956 






(like the copy number of a gene). Furthermore, amplifica- 


Scaffolds 3 


50,136 


56,363 




tions in a whole-genome sequencing protocol are not tar- 


Total length (nt) 


1 50,904,880 


1 56,075,852 


130,979,321 


geted toward any particular marker genes, which may 


Average length (nt) 


3,009 


2,769 




remove further biases. A limitation of the method pre- 


NSO pnnth (nt) 

v jw ici ly lj i \i ' L/ 


6,141 


12,1 17 




sented here is that using k-mers alone to compare 


1 nnnpct pnnth I 1 nt) 

LUI lyCoL ICI lyLl 1 \\ \l) 


1 46 549 


570,359 




sequences is highly stringent. On the other hand, aligner- 


FRS006497 








based approaches can accommodate for an identity as low 


Reads 


l??444Q?f) 






as 70% between sequences as sequence reads are usually 


Scaffolds 3 


61,093 


52,194 




mapped to reference bacterial genomes. At the crux of our 


Tnt^l Ipnnth fntl 

1 ULal ICIIVJUI \J 1 Ly 


1 1 3 403 805 


1 1 1,187,163 


94,649,612 


method is the use of uniquely colored k-mers for signal 


Aupnnp Ipnnth fntl 

nVClQUC Id ly LI 1 \\ Wj 


1 856 


2,130 




demultiplexing (see Materials and methods). Sequencing 


N50 length (nt) 


2,778 


5,430 




errors produce erroneous k-mers. One of the advantages of 


1 nnnp^t Ipnnth (nt) 

LUI ILJCTjL ICTI lyLl 1 \\ ii_y 


1 1 5 684 


430,963 




using a de Bruijn graph is that erroneous k-mers have a 


Ri mninn timp /n-minl 
r\u i 1 1 1 1 1 ly li i i ic \\ i.i 1 1 ii \j 


4'34 


10:06 




small probability of being considered in the assembly [31], 


FRS00659? 








hence sequencing errors do not contribute to taxonomic 


Reads 








profiling for assembled sequences. However, alignment- 


Scaffolds 3 


4,358 


9,387 




based approaches will likely a higher sensitivity than k-mer 


Total length (nt) 


19,501,348 


24,687,275 


18,061,386 


based approaches because they are more tolerant to mis- 


A\/pr;anp pnnth Int^ 
rAVCiayc iciiyLii V'' L / 


4474 


2,629 




matches. Yet, the present work showed that metagenome 


NSO pnnth (nt) 

n jw icriiyLii \iiLy 


8 819 


10,277 




profiling is efficiently done with k-mer counting, through 


Longest length (nt) 


87,983 


137,473 




the use of a colored de Bruijn graph [7], and that it is also 


Ri mninn timp fh'min) 
m l.i i ii iii i y li i i ill u i . i i 1 1 1 i / 


0:41 


4:28 




sensitive (Figure 2) and produces results similar to those of 


CDcni i nqo 

JRJU 1 1 U¥0 








MetaPhlAn (Table 2). With this approach, conserved DNA 


Reads 


707 487 773 






regions captured the biological abundance of bacteria in a 


Scaffolds 3 


30,458 


36,130 




sample. A k-mer length of 31 was used to give a high strin- 


Total length (nt) 


60,574,679 


83,736,387 


51,938,031 


gency in the coloring process. The low error rate of the 


A\/pr:anp Ipnnth (ntl 
rAvciayt ici ly li i \\ 11/ 


1 988 


2,317 




sequencing technology enabled the capture of error-free k- 


MSO pnnth (nt) 

n jw iLTiiyui \iiLy 


3 117 


4,961 




mers for most of the genomic regions, meaning that it was 


1 nnnp^t Ipnnth (nt) 

L_LJI lyCTPL Id lyLl 1 \\ 


1 92 898 


222,21 3 




unlikely that a given k-mer occurred in the sequence reads, 


Ri mninn timp frvmin) 
nuiiiiniy liiiic y 1.1 1 1 1 1 1 f 


8'34 


6:38 




in a known genome, but not in the actual sample. 


SRS01 7777 










Reads 


1 39 00? 751 






Validation of assemblies 


Scaffolds 3 

JLu 1 1 \J 1 LJ J 


106 957 


89,953 




Using MUMmer [44], we validated the quality of assem- 


Tnt^l Ipnnth fntl 

1 ULal ICI iy LI 1 \.' ' v 


1 71 700 737 


1 86,958,660 


126,068,148 


blies produced by Ray Meta. The quality test used was 


Average length (nt) 


1,600 


2,078 




very stringent because any contig not aligning as one sin- 


N50 length (nt) 


2,168 


3,771 




gle maximum unique match with a breadth of coverage of 


Longest ength (nt) 


1 02,749 


224,709 




at least 98% was marked as misassembled. In Table 1, the 


Running time (h:min) 


9:00 


7:10 




number of shared k-mers between assemblies produced by 


SRSOl 8661 








MetaVelvet and Ray Meta is shown. Although the overlap 


Reads 


288,475,1 94 






is significant, the k-mers unique to MetaVelvet or Ray 


Scaffolds 3 


30,709 


18,541 
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Total length (nt) 


35,281,226 


36,891,130 


21,659,465 


improvements in sequencing technologies will provide 


Average length (nt) 


1,148 


1,989 




longer reads with higher coverage depths. These advances 


N50 length (nt) 


1,223 


3,794 




will further improve de novo assemblies. 


Longest length (nt) 


111,404 


377,149 






Running time (h:min) 


1:24 


4:42 




Conclusions 


a 0nly scaffolds with a I 


enqth higher or equal to 500 were considered, nt, 


Scalability is a requirement for analyzing large metagen- 


nucleotide. 








ome datasets. We described a new method to assemble 
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Table 2 Correlation of taxonomic abundances produced by MetaPhlAn and Ray Communities 



RnHu <;itp 


Jul 1 IUICTj 


Phvlum 
r 1 i y i u 1 1 1 


Class 


Order 


Fami lv 

l d 1 1 my 




Species 


Anterior nares 


■ ■ 1 5 


0.9 1 


0.92 


0.94 


0.94 


0.59 


0.59 


Attached keratinized gingival 


3 


0.99 


0.94 


0.92 


0.94 


0.84 


0.71 


Buccal mucosa 


1 1 0 


n qq 
u.yy 


n qq 
u.yo 


n Q7 
u.y/ 


n qq 
u.yo 


n 

u.yj 


u.y i 


Left retroauricular crease 


D 


n no 

u.yy 


r\ oo 

u.yy 


r\ on 
u.yy 


r\ oo 
u.yy 


U./ z 


U.03 


Mid vagina 


1 
1 


n on 

u.yy 


r\ oo 

u.yy 


n oo 

u.yy 


r\ oo 

u.yy 


n oo 

u.yy 


n on 

u.yu 


Palatine tonsils 


A 
'\ 


u.yu 


U.oU 


n 70 
u./y 


U.oj 


U.o4 


n 07 
u.y / 


Posterior fornix 


23 


0.99 


0.99 


0.99 


0.99 


0.97 


0.94 


Right retroauricular crease 


6 


0.94 


0.92 


0.93 


0.94 


0.83 


0.91 


Saliva 


3 


0.97 


0.87 


0.88 


0.96 


0.89 


0.95 


Stool 


61 


0.80 


0.81 


0.81 


0.62 


0.92 


0.84 


Subgingival plaque 


5 


0.86 


0.75 


0.76 


0.74 


0.81 


0.93 


Supragingival plaque 


53 


0.94 


0.93 


0.92 


0.88 


0.89 


0.93 


Throat 


6 


0.95 


0.86 


0.87 


0.92 


0.92 


0.80 


Tongue dorsum 


53 


0.93 


0.80 


0.79 


0.84 


0.85 


0.88 


Vaginal introitus 


1 


1.00 


1.00 


0.99 


0.99 


0.99 


0.97 



Total 313 

Pearson's correlation was utilized to compare taxonomic abundance for 313 samples from various body sites [24]. 



(Ray Meta) and profile (Ray Communities) a metagenome 
in a distributed fashion to provide unmatched scalability. 
It computes a metagenome de novo assembly in parallel 
with a de Bruijn graph. The method also yields taxonomic 
profiles by coloring the graph with known references and 
by looking for uniquely colored k-mers to identify taxons 
at low taxonomic ranks or by using the lowest common 
ancestor otherwise. Ray Meta surpassed Meta Velvet [39] 
for de novo assemblies and Ray Communities compared 
favorably to MetaPhlAn [27] for taxonomic profiling. 

While taxonomic and functional profiling remains a 
useful approach to obtain a big picture of a particular 
sample, only de novo metagenome assembly can truly 
enable discovery of otherwise unknown genes or other 
important DNA sequences hidden in the data. 

Materials and methods 

Thorough documentation and associated scripts to 
reproduce our studies are available in Additional file 3 on 
the publisher website or on https://github.com/sebhtml/ 
Paper-Replication-2012. 

Memory model 

Ray Meta uses the message-passing interface. As such, a 
1,024-core job has 1,024 processes running on many com- 
puters. In the experiments, each node had 8 processor 
cores and 24 GB, or 3 GB per core. With the message-pas- 
sing paradigm, each core has its own virtual memory, 
which is protected from any other process. Because the 
data are distributed uniformly using a distributed hash 
table, memory usage for a single process is very low. For 
the 1,024-core job, the maximum memory usage of any 
process was on average 1.5 GB. 



Assemblies 

Metagenome assemblies with profiling were computed 
with Ray v2.0.0 (Additional file 4) on Colosse, a Compute 
Canada resource. Ray is open source software - the license 
is the GNU General Public License, version 3 (GPLv3) - 
and is freely available from http://denovoassembler.source- 
forge.net/ or http://github.com/sebhtml/ray. Ray can be 
deployed on public compute infrastructure or in the cloud 
(see [45] for a review). 

The algorithms implemented in the software Ray were 
heavily modified for metagenome de novo assembly and 
these changes were called Ray Meta. Namely, the coverage 
distribution for k-mers in the de Bruijn graph is not uti- 
lized to infer the average coverage depth for unique geno- 
mic regions. Instead, this value is derived from local 
coverage distributions during the parallel assembly pro- 
cess. Therefore, unlike Meta Velvet [39], Ray Meta does 
not attempt to calculate or use any global k-mer coverage 
depth distribution. 

Simulated metagenomes with a power law 

Two metagenomes (100 and 1,000 genomes, respectively) 
were simulated with abundances following a power law 
(Tables SI and S2 in Additional file 1). Power laws are 
commonly found in biological systems [46]. Simulated 
sequencing errors were randomly distributed, the error 
rate was set at 0.25% and the average insert length was 
400. The second simulated metagenome was assembled 
with 128 8-core computers (1,024 processor cores) inter- 
connected with a Mellanox ConnectX QDR Infiniband 
fabric (Mellanox, Inc.). For the 1,000-genome dataset, 
messages were routed with a de Bruijn graph of degree 32 
and diameter 2 to reduce the latency. 
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Validation of assemblies 

Assembled contigs were aligned onto reference genomes 
using the MUMmer bioinformatics software suite [44]. 
More precisely, deltas were generated with nucmer. 
Using show-coords, any contig not aligning as one single 
maximum with at least 98% breadth of coverage was 
marked as misassembled. Contigs aligning in two parts at 
the beginning and end of a reference were not counted as 
misassembled owing to the circular nature of bacterial 
genomes. Finally, small insertions/deletions and mis- 
matches were obtained with show-SNPs. 

Colored and distributed de Bruijn graphs 

The vertices of a de Bruijn graph are distributed across 
processes called ranks. Here, graph coloring means label- 
ing the vertices of a graph. A different color is added to 
the graph for each reference sequence. Each k-mer in any 
reference sequence is colored with the reference 
sequence color if it occurs in the distributed de Bruijn 
graph. Therefore, any k-mer in the graph has zero, one 
or more colors. First, a k-mer with no colors indicates 
that the k-mer does not exist in the databases provided. 
Second, a k-mer with one color means that this k-mer is 
specific to one and only one reference genome in the 
databases provided while at least two colors indicates 
that the k-mer is not specific to one single reference 
sequence. These reference sequences are assigned to 
leaves in a taxonomic tree. Reference sequences can be 
grouped in independent name spaces. Genome assembly 
is independent of graph coloring. 

Demultiplexing signals from similar bacterial strains 

Biological abundances were estimated using the product of 
the number of k-mers matched in the distributed de 
Bruijn graph by the mode coverage of k-mers that were 
uniquely colored. This number is called the number of k- 
mer observations. The total number of k-mer observations 
is the sum of coverage depth values of all colored k-mers. 
A proportion is calculated by dividing the number of k- 
mer observations by the total. 

Taxonomic profiling 

All bacterial genomes available in GenBank [47] were uti- 
lized for coloring the distributed de Bruijn graphs (Table 
S4 in Additional file 1). Each k-mer was assigned to a 
taxon in the taxonomic tree. When a k-mer has more 
than one taxon color, the coverage depth was assigned to 
the nearest common ancestor. 

Gene ontology profiling 

The de Bruijn graph was colored with coding sequences 
from the EMBL nucleotide sequence database [48] 
(EMBL CDS), which are mapped to gene ontology by 
transitivity using the uniprot mapping to gene ontology 



[49]. For each ontology term, coverage depths of colored 
k-mers were added to obtain the total number of k-mer 
observations. 

Principal component analysis 

Principal component analysis was used to group taxo- 
nomic profiles to produce enterotypes. Data were pre- 
pared in a matrix using the genera as rows and the 
samples as columns. Singular values and left and right 
singular vectors of this matrix were obtained using singu- 
lar value decomposition implemented in R. The right sin- 
gular vectors were sorted by singular values. The sorted 
right singular vectors were used as the new base for the 
re-representation of the genus proportions. The two first 
dimensions were plotted. 

Software implementation 

Ray Meta is distributed software that runs on connected 
computers by transmitting messages over a network 
using the message-passing interface (MPI) and is imple- 
mented in C++. The MPI standard is implemented in 
libraries such as Open-MPI [50] and MPICH2 [51]. On 
each processor core, tasks are divided into smaller ones 
and given to a pool of 32,768 workers (thread pool), 
which are similar to chares in CHARM++ [52]. Each of 
these sends messages to a virtual communicator. The lat- 
ter implements a message aggregation strategy in which 
messages are automatically multiplexed and demulti- 
plexed. The k-mers are stored in a distributed sparse 
hash table which utilizes open addressing (double hash- 
ing) for collisions. Incremental resizing is utilized in this 
hash table when the occupancy exceeds 90% to grow 
tables locally. Smart pointers are utilized in this table to 
perform real-time memory compaction. The software is 
implemented on top of RayPlatform, a development fra- 
mework used to ease the creation of massively distributed 
high-performance computing applications. 

Comparison with MetaVelvet 

Software versions used were: MetaVelvet 1.2.01, Velvet 
1.2.07 and Ray 2.0.0 (with Ray Meta). MetaVelvet was run 
on one processor core. Ray Meta was run on 64 processor 
cores for Human Microbiome Project samples (SRSO 
11098, SRS017227 and SRS018661) and on 48, 32 and 32 
processor cores for MetaHIT samples (ERS006494, 
ERS006497 and ERS006592), respectively. There were 
eight processor cores per node. The running time for 
MetaVelvet is the sum of running times for velveth, vel- 
vetg and meta-velvetg. For MetaVelvet, sequence files 
were filtered to remove any sequence with more than ION 
symbols. The resulting files were shuffled to create files 
with interleaved sequences. The insert size was manually 
provided to MetaVelvet and the k-mer length was set to 
51 as suggested in its documentation. Peak coverages were 
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determined automatically by MetaVelvet. Ray Meta was 
run with a k-mer length of 31. No other parameters were 
required for Ray Meta and sequence files were provided 
without modification to Ray Meta. The overlaps of assem- 
blies produced by MetaVelvet and by Ray Meta were eval- 
uated with Ray using the graph coloring features. No 
mismatches were allowed in k-mers. Overlaps were com- 
puted for scaffolds with at least 500 nucleotides. 

Comparison with MetaPhlAn 

Taxonomic profiles calculated with MetaPhlAn [27] for 
samples from the Human Microbiome Project were 
obtained [24]. Taxonomic profiles were produced by 
Ray Communities for 313 samples (Additional file 2). 
Pearson's correlation was calculated for each body site 
by combining taxon proportions for both methods for 
each taxonomic rank. 

Additional material 
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